12/9/23
Lets first construct the data Guery analysed.
# personal crimes
tibble(
Year = 1825:1830,
North = c(25, 24, 23, 26, 25, 24),
South = c(28, 26, 22, 23, 25, 23),
East = c(17, 21, 19, 20, 19, 19),
West = c(18, 16, 21, 17, 17, 16),
Central = c(12, 13, 15, 14, 14, 18)
)# A tibble: 6 × 6
Year North South East West Central
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1825 25 28 17 18 12
2 1826 24 26 21 16 13
3 1827 23 22 19 21 15
4 1828 26 23 20 17 14
5 1829 25 25 19 17 14
6 1830 24 23 19 16 18
# personal crimes
tibble(
Year = 1825:1830,
North = c(25, 24, 23, 26, 25, 24),
South = c(28, 26, 22, 23, 25, 23),
East = c(17, 21, 19, 20, 19, 19),
West = c(18, 16, 21, 17, 17, 16),
Central = c(12, 13, 15, 14, 14, 18)
)# A tibble: 6 × 6
Year North South East West Central
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1825 25 28 17 18 12
2 1826 24 26 21 16 13
3 1827 23 22 19 21 15
4 1828 26 23 20 17 14
5 1829 25 25 19 17 14
6 1830 24 23 19 16 18
# property crimes
tibble(
Year = 1825:1830,
North = c(41, 42, 42, 43, 44, 44),
South = c(12, 11, 11, 12, 12, 11),
East = c(18, 16, 17, 16, 14, 15),
West = c(17, 19, 19, 17, 17, 17),
Central = c(12, 12, 11, 12, 13, 13)
)# A tibble: 6 × 6
Year North South East West Central
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1825 41 12 18 17 12
2 1826 42 11 16 19 12
3 1827 42 11 17 19 11
4 1828 43 12 16 17 12
5 1829 44 12 14 17 13
6 1830 44 11 15 17 13
Next we can put these together in a dataset that we will analyze.
guerry <-
tibble(
Month =
factor(
format(
ISOdate(1833, 1:12, 1), "%b"),
levels = format(ISOdate(1833, 1:12, 1), "%b")
),
Person = c(
69, 70, 85, 78, 92, 99,
89, 95, 88, 75, 78, 82
),
Property = c(
96, 81, 84, 75, 77, 78,
71, 82, 80, 85, 89, 102
)
)
guerry |>
top_n(4)# A tibble: 4 × 3
Month Person Property
<fct> <dbl> <dbl>
1 Jan 69 96
2 Oct 75 85
3 Nov 78 89
4 Dec 82 102
It appears that Person crimes are greater in summer, while property crimes are greater in winter.
In fact, the seasonal pattern is well approximated by a sine curve.
guerry_fit <-
guerry |>
pivot_longer(Person:Property, names_to = "type", values_to = "rate") |>
filter(type == "Person") |>
mutate(x = as.numeric(Month)) %>%
lm(rate ~ cos(x * 2 * pi / 12) + sin(x * 2 * pi / 12), data = .) |>
coef() |>
unname()
tibble(
"$alpha_2$" = guerry_fit[2],
"$alpha_3$" = guerry_fit[3],
"$fi$" = atan(guerry_fit[3] / guerry_fit[2]),
"$A$" = sqrt(guerry_fit[2]^2 + guerry_fit[3]^2)
)# A tibble: 1 × 4
`$alpha_2$` `$alpha_3$` `$fi$` `$A$`
<dbl> <dbl> <dbl> <dbl>
1 -10.1 -4.18 0.393 10.9
We can look at how this curve looks overlaid on the true data.
We can easily see this with a quick example. Linear regression is unable to capture the clear structure in the data.
We can get around this by essentially doing some transformation of the data and then fitting a linear regression.
If we use a sinusoid curve to do this, it it parameterized by the amplitude, frequency and the phase.
After fitting this model we can look at the fit to the data. Closely recovers the periodic component.
Given some knowledge of trigonometry, the above curve looks something like a Sine of Cosine function. One way to do this is to fit the data with a Sine curve \[Y_t = A \sin(2\pi\omega t + \phi) + B.\]
This has an interpretation, but as it is currently written, it is still not in the form of linear regression.
To do that, we make use of the trigonometric identity \[\sin(\alpha+\beta) = \sin \alpha\cos \beta + \cos \alpha \sin \beta.\]
Using this, then we get \[ Y_t = A \cos \phi \sin (2 \pi \omega t) + A \sin \phi \cos(2 \pi \omega t) + B.\] Letting \[ X_1 = \sin (2 \pi \omega t),\; X_2 = \cos (2\pi \omega t),\] \[\alpha_1 = A \cos \phi,\; \alpha_2 = A \sin \phi,\] then we have \[ Y = \alpha_1 X_1 + \alpha_2 X_2 + B.\] This is a linear regression model in the new variables \(X_1\), \(X_2\).
What do these parameters mean?
An incomplete list of references.